Identifying Land Use Land Cover(LULC) with Landsat Image using Machine Learning Methods


Satellite image classification using PCA, K-means clustering, Random Forest, Bagging, Classification Tree, Logistic Regression, and Support Vector Machine(SVM)


This project investigates various machine learning techniques in predicting Land Use/Land Cover(LULC) classes. For this analysis, I have chosen Knoxville, TN, as a case study site and utilized a Landsat image dataset obtained from USGS Earth Explore. This project is centered around two primary objectives; first, identifying the most effective predictors for classifying LULC using two most widely recognized unsupervised learning algorithm - Principal Component Analysis(PCA) and K-means clustering; and second, automatically delineating specific land use types - urban, forest, cropland, and water - through various supervised learning algorithms including Random Forest, Bagging, Classification Tree, Logistic Regression, and Support Vector Machine(SVM).

This project has yielded several key insights: 1) all supervised methods demonstrated comparable performance regarding producer’s accuracy rates; 2) while these algorithms effectively identified water bodies and forests, they exhibited a significant level of confusion in differentiating between cropland and urban areas; 3) the classification tree, with its visual representation, provided the most interpretable results, although it was slightly less accurate than Random Forest method in terms of misclassification rates.;4) among all predictors employed in the study, band1 and the Normalized Difference Vegetation Index(NDVI) emerged as the most critical variables for determining LULC types.


Project Outline


Setting Up a Working Environment

For this project, I utilized the raster and rgdal packages primarily for image processing; tidyverse and knitr for data wrangling; the vegan and MASS packages for data processing; ggplot for data visualization; and randomForest, rpart, and e1071 for implementing machine learning algorithms.

library(raster)
library(rgdal)
library(tidyverse)
library(vegan)
library(MASS)
library(knitr)
library(ggplot2)
library(randomForest)
library(rpart)
library(e1071)
setwd('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/stat577_ML_project/final project/analysis & data')

Data Description and Pre-processing

The Landsat image of Knoxville utilized in this project is derived from Landsat5 data, which was downloaded from USGS Earth Explorer. The original resolution of the image is 30m by 30m, but it has been downsampled to 90m by 90m to reduce the overall size of the dataset. The image dimensions are 498 pixels in width, 551 pixels in height, and consist of 7 bands. The figure below displays the Knoxville Landsat image from February that I employed for the analysis.



The intensity per band is what will be used in a multivariate data driven approach to determine the land cover classifications. The R package Raster is used to extract these values from Landsat image in .tif format. The entire file contains 1,916,929 individual readings.



The seven band responses (as shown in the figure above) appear to be normally distributed. The means, medians, and standard deviations of each band are as follows:


Band Mean Median Standard Deviation
Band-1 53.57 52 7.91
Band-2 23.66 23 5.41
Band-3 25.64 24 8.28
Band-4 39.14 39 12.49
Band-5 65.79 66 23.73
Band-6 101.89 102 5.20
Band-7 30.72 31 11.10
Table 1. Band Intensity Description


To create the categorical response variable comprising four land use types - determined by the analysis results of unsupervised learning, as detailed in the section below - I employed the drawPoly function from the raster package to draw polygons on the Landsat image, which served as training datasets for urban areas, forests, croplands, and waterbodies. The figure below illustrates the polygons used to create the training datasets. The red polygons on the map represent urban areas, while the green polygons indicate forested regions. Blue areas capture water bodies, and the brown polygons denote croplands


#Interactive zooming in. Plotting band3 of feb (example)
newextent <- zoom(feb[[3]]) 
plotRGB(feb, r=3, g=2, b=1, ext=newextent, stretch='lin')

urb1 <- drawPoly(sp=TRUE, col='red', lwd=2)
urb2 <- drawPoly(sp=TRUE, col='red', lwd=2)
urb3 <- drawPoly(sp=TRUE, col='red', lwd=2)
urb4 <- drawPoly(sp=TRUE, col='red', lwd=2)
forest1 <- drawPoly(sp=TRUE, col='red', lwd=2)
forest2 <- drawPoly(sp=TRUE, col='red', lwd=2)
crop1 <- drawPoly(sp=TRUE, col='red', lwd=2)
crop2 <- drawPoly(sp=TRUE, col='red', lwd=2)
crop3 <- drawPoly(sp=TRUE, col='red', lwd=2)
water1 <- drawPoly(sp=TRUE, col='red', lwd=2)
water2 <- drawPoly(sp=TRUE, col='red', lwd=2)#portion of TN river
water3 <- drawPoly(sp=TRUE, col='red', lwd=2) #Douglas lake
save(urb1, urb2, urb3, urb4, forest1, forest2, crop1, crop2, crop3,
     water1, water2, water3, file = 'train_dh_poly.Rdata')

#Overlaying polygon training regions on top of Feb image
load("train_dh_poly.Rdata")
plotRGB(feb, 3, 2, 1, stretch="lin")
plot(urb1, add=TRUE, border="red")
plot(urb2, add=TRUE, border="red")
plot(urb3, add=TRUE, border="red")
plot(urb4, add=TRUE, border="red")
plot(forest1, add=TRUE, border="green")
plot(forest2, add=TRUE, border="green")
plot(crop1, add=TRUE, border="brown")
plot(crop2, add=TRUE, border="brown")
plot(crop3, add=TRUE, border="brown")
plot(water1, add=TRUE, border="blue")
plot(water2, add=TRUE, border="blue")
plot(water3, add=TRUE, border="blue")

#creating training dataset
lsat.labels <- rep(NA, ncell(feb))
lsat.labels[unlist(cellFromPolygon(feb,urb1))] <- "urban"
lsat.labels[unlist(cellFromPolygon(feb,urb2))] <- "urban"
lsat.labels[unlist(cellFromPolygon(feb,urb3))] <- "urban"
lsat.labels[unlist(cellFromPolygon(feb,urb4))] <- "urban"
lsat.labels[unlist(cellFromPolygon(feb,forest1))] <- "forest"
lsat.labels[unlist(cellFromPolygon(feb,forest2))] <- "forest"
lsat.labels[unlist(cellFromPolygon(feb,crop1))] <- "cropland"
lsat.labels[unlist(cellFromPolygon(feb,crop2))] <- "cropland"
lsat.labels[unlist(cellFromPolygon(feb,crop3))] <- "cropland"
lsat.labels[unlist(cellFromPolygon(feb,water1))] <- "water"
lsat.labels[unlist(cellFromPolygon(feb,water2))] <- "water"
lsat.labels[unlist(cellFromPolygon(feb,water3))] <- "water"

train.ids <- (!is.na(lsat.labels)) #Get a list of which rows are training data

In addition to the Landsat image, the Normalized Difference Vegetation Index (NDVI) is another widely used metric for assessing vegetation dynamics. NDVI can be computed from Landsat band 3 (visible red) and band 4 (near infrared). This index is particularly valuable for classifying different land cover types, especially in areas with vegetation. As such, the NDVI layer is also used in our study as one of the covariates for land cover classification. The computed NDVI layer is displayed below (top right). The distributions of NDVI for each land use type, as shown below (left and bottom right), reveal that cropland and water bodies are distinctly separated from other land use types, while urban areas and forests significantly overlap with one another. This overlap can be attributed to the presence of green spaces within urban areas.


ndvi <- overlay(feb$feb.3, feb$feb.4, fun=function(x,y){(y-x)/(x+y)})
par(mfrow=c(1,1))
plot(ndvi, main="NDVI")
covs <- addLayer(feb, ndvi) #combining feb layers and ndvi
names(covs) <- c("band1", "band2","band3","band4","band5","band6","band7","NDVI")
plot(covs)

all.data <- data.frame(labels=as.factor(lsat.labels), data.matrix(values(covs)))
tr_subset <- subset(all.data, train.ids)

#Visualizing the distribution of covariates for each class
mylist <- split(tr_subset, tr_subset$labels) #subsetting training data by each class
val_crop <- mylist$cropland
val_urb <- mylist$urban
val_forest <- mylist$forest
val_water <- mylist$water

par(mfrow = c(4,1))
hist(val_crop$band4, main="Cropland", xlab="NDVI", xlim=c(0,100),breaks=seq(0,100, by=5), col="orange")
hist(val_urb$band4, main="Urban", xlab="NDVI", xlim=c(0,100), breaks=seq(0,200, by=5), col="grey")
hist(val_forest$band4, main="Forest", xlab="NDVI", xlim=c(0,100),breaks=seq(0,100, by=5), col="green")
hist(val_water$band4, main="Water", xlab="NDVI", xlim=c(0,100), ylim=c(0,50),breaks=seq(0,100, by=1), col="blue")

#Scatter plot using band3 and 4 
ggplot(data=subset(all.data, train.ids))+
  geom_point(aes(x=band3, y=band4, color=labels), size=1, alpha=0.5)+
  guides(color=guide_legend(override.aes = list(size=3, alpha=1)))

I have integrated NDVI into the February Knoxville raster brick. The figure below displays all layers I used for LULC classification.


Unsupervised Learning: Data Driven Approaches for Classification

To determine the appropriate number of land use classes for the images, we first adopt a data-driven approach. This entails utilizing the seven bands of information from the images, rather than relying on visual inspection to classify different land use types such as agricultural, urban, or forested areas. In this data-driven approach, we will analyze the multivariate band responses using Principal Component Analysis (PCA) and K-means clustering techniques. The results from PCA will help visually identify the number of distinct land use types that may be relevant to the spectral data, while K-means clustering will quantify the number of clusters present in the spectral data, which can then be interpreted as the number of land use classes. This analysis is predicated on the assumption that the data are normally distributed???a premise that appears valid based on the band intensity data presented above.

Principal Component Analysis

set.seed(5)
test <- feb_bands[sample(nrow(feb_bands), 5000),]
test <- as.matrix(na.omit(test))
pca <- princomp(test, cor=T)
plot(pca$sdev^2/sum(pca$sdev^2), type="b", ylab="Proportion of Variance" ,xlab="Principal Compnents", main="Proportion of Variance for PCA")

a <- rda(test, scale = T)
biplot(a, type = c("text", "points"))

kable(as.data.frame(pca$loadings[,1:5]), digits=round(4))

par(mfrow=c(1,2))
biplot(a, display= "sites")
biplot(a, display= "sites", choices = c(1,3))

To complete the PCA, 5,000 randomly sampled pixels from the image were selected, as the entire dataset contains 1,916,929 individual readings. The training dataset was analyzed using the Vegan package for plotting purposes and the function princomp was employed to extract the results.

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
0.83 0.08 0.05 0.03 0.006 0.00227 0.00176
Table 2. PCA - Proportion of Variance


PCA targets and obtains uncorrelated components by focusing on variance. The scree plot suggests two PCs since the elbow shape occurs at PC=2. The first two PCs explain 91% of variance.


The first two PC???s are plotted where the points represent each randomly selected pixel and the arrows representing the scores of each band. Bands with the highest scores will be more effective in determining land use classes. From these results, the bands 1-3 & 7 have the highest scores along PC1 & bands 1-2, 4-5 along PC2. A table of the results is below.

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Band1 0.3714 0.5517 0.0922 0.0014 0.7034
Band2 0.3901 0.4045 0.1050 0.1433 0.3639
Band3 0.4015 0.2382 0.1578 0.0543 0.5734
Band4 0.3652 0.3981 0.0851 0.8000 0.1318
Band5 0.3784 0.4852 0.1963 0.2560 0.0161
Band6 0.3477 0.0597 0.9314 0.0870 0.0174
Band7 0.3890 0.2790 0.2055 0.5133 0.1613
Table 3. PCA Loadings

The PCA loadings indicate that all seven bands contribute nearly equally to the first principal component (PC). Band 1 exhibits significant loading on the second and fifth PCs, while band 4 is heavily loaded on the fourth PC, and band 6 shows substantial loading on the third PC. These results suggest potential clustering along the second and third axes. However, given the very low proportion of variance explained by all axes beyond the first principal component, it is unlikely that these groupings would be particularly meaningful, though they may still provide useful insights.

Examining the PCA distribution reveals three distinct groupings that could be utilized to identify different land classes. The first cluster is the dominant “blob” near the origin, while the second consists of a small number of points that are positive along PC1 and negative along PC2. The third grouping includes points that are negative along both PC1 and PC2. The primary cluster may represent a land use type that constitutes the majority of the image. The second grouping appears as a tight cluster within the test data and may correspond to a well-defined class, such as water. The final class of land use seems to be more dispersed and may encompass multiple classifications. While the results from the PCA are not definitive, they do identify three clusters of data. It appears that several bands are beneficial for classifying land cover types, with bands 1, 2, 3, and 5 being the most useful.

K-Means Clustering Analysis

###*K*-Means Sum of Squares
set.seed(3)
ss = rep(0,10)
for(i in 1:10)
{
  km = kmeans(test,i,nstart=10)
  ss[i] = km$tot.withinss
}

par(mfrow=c(1,2))
plot(1:10,ss,xlab = "k",ylab="total with-in ss",type='l') # plot of k versus the total with-in ss
ss_lag = ss[2:10]
ss_diff = ss[1:9] - ss_lag
plot(1:9,ss_diff,xlab = "k",ylab="reduction in total with-in ss",type='l') # plot of k versus the reduction in total with-in ss

###*K*-Means Results
kmeans.3 <- kmeans(na.omit(test),3,nstart = 20)
kmeans.4 <- kmeans(na.omit(test),4,nstart = 20)
par(mfrow=c(2,2))
plot(test[,1],test[,2],col=(kmeans.3$cluster),pch=20, cex=.8,xlab="X-1",ylab="X-2",main="k-means results with k=3")
plot(test[,4],test[,5],col=(kmeans.3$cluster),pch=20, cex=.8,xlab="X-4",ylab="X-5",main="k-means results with k=3")
plot(test[,1],test[,2],col=(kmeans.4$cluster),pch=20, cex=.7,xlab="X-1",ylab="X-2",main="k-means results with k=4")
plot(test[,4],test[,5],col=(kmeans.4$cluster),pch=20, cex=.7,xlab="X-4",ylab="X-5",main="k-means results with k=4")

K-means clustering is the second data-driven technique employed to identify data clusters. The kmeans function from the base R package was utilized to assess the reduction in Sum of Squares (SS) for up to 10 clusters using the training data. The results indicate a significant reduction in SS at 3 to 4 clusters, which aligns with the findings from the PCA. To visualize how these clusters are distributed along the axes of highest variance, plots were created for bands 1, 3, 4, and 5. The K-means results (shown in the figure below) are clearly defined along both sets of axes, suggesting that either 3 or 4 classes could be used for the final analysis, depending on the plotted results.

Supervised Learning Approaches for Classification

In this section, we explore five different supervised classification methods: Random Forest, Bagging, Regression Tree, Logistic Regression, and Support Vector Machine.

Random Forest

#decide ntree: we tried ntree=100,200,300,400,500,600,700,800,900,1000
#ntree=700 returns the smallest error for each of the four land use types

#fit model
modelRF <- randomForest(x=tr_subset[,c(2:9)], y=tr_subset$labels, ntree=700 
                        ,mtry=3 ,importance=TRUE)

#predicting land cover types by using all data.
predRF <- predict(covs, model=modelRF, na.rm=TRUE)

#visualizing predicted land cover classes
cols <- c("orange", "green", "grey", "blue")
dev.off()
par(mfrow=c(1,1))
plot(predRF, col=cols, legend=FALSE, main="Random Forest")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")
modelRF$confusion

#variable importance plot
varImpPlot(modelRF)

By fitting the Random Forest model with mtry set to 3 (the square root of p, where is 9) and ntree set to 700 (after testing ntree values of 100,200,300,400,500,600,700,800,900,and 1000, we found that ntree=700 returns the smallest error for each of the four land use types), we obtain the classification error as presented below.

Cropland Forest Urban Water Class Error
Cropland 371 15 71 0 0.19
Forest 5 6986 5 0 0.001
Urban 88 7 1633 0 0.055
Water 0 0 1 289 0.003
Table 4. Random Forest results

We can observe that the classification errors for water bodies and forests are very low, both below 1%. In contrast, over 5% of urban areas and nearly 20% of croplands are misclassified. The primary confusion occurs between croplands and urban areas, which aligns with our earlier visualizations in the previous sections. The figure below illustrates the different land use types predicted by the Random Forest model.

The variable importance plots for the Random Forest model display the mean decrease in accuracy and the decrease in the Gini impurity coefficient for each variable. In our analysis, it appears that band1 and NDVI significantly impact accuracy, while band1 and band2 score highest according to the Gini impurity criterion. For a large dataset like ours, this information can be beneficial, allowing us to exclude less important variables in future analyses to enhance classification accuracy. Additionally, we observe that band3 is deemed insignificant in this context, contrasting with our findings during the unsupervised learning phase. This discrepancy may be attributed to our integration of band3 and band4 into NDVI during data preprocessing.

Bagging

modelBag <- randomForest(x=tr_subset[,c(2:9)], y=tr_subset$labels, ntree=700 
                         ,mtry=8 ,importance=TRUE)
predBag <- predict(covs, model=modelBag, na.rm=TRUE)
plot(predBag, col=cols, legend=FALSE, main="Bagging")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")
modelBag$confusion
varImpPlot(modelBag)

With mtry set to 8 and ntree set to 700, we fit the Bagging model and obtained the classification errors displayed in Table 5. The classification error rates for forest and water bodies are identical to those from the Random Forest model, while the error rates for cropland and urban areas are higher in the Bagging model compared to the Random Forest results. The figure below illustrates the different land use types predicted by the Bagging model.

Cropland Forest Urban Water Class Error
Cropland 368 13 76 0 0.19
Forest 8 6986 2 0 0.001
Urban 93 7 1628 0 0.058
Water 0 0 1 289 0.003
Table 5. Bagging results

The variable importance plot below displays that band1 and NDVI are the most important predictors for both mean decrease accuracy and mean decrease Gini.

Regression Tree

modelTree <- rpart(labels~.,data=tr_subset, method="class")
plot(modelTree, margin=0.05)
text(modelTree, use.n=TRUE, cex=0.8, pretty=0)
predTree <- predict(modelTree, all.data, type="class")
tree.map <- raster(feb)
values(tree.map) <- predTree
cols <- c("orange", "green", "grey", "blue")
plot(tree.map, col=cols, legend=FALSE, main="Regression Tree")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")
conf.mat <- table(pred=predTree[train.ids], train=all.data[train.ids, "labels"])
conf.mat

From the confusion matrix below, it is evident that the Regression Tree classification does not perform as well as the Random Forest or Bagging models. The misclassification error rates reinforce our earlier findings, indicating that water and forest types are well-classified, while cropland is the most misclassified land use type, achieving an accuracy of only 76.5%. In contrast, the other land use types exhibit accuracy rates exceeding 90%.

Cropland Forest Urban Water Class Error
Cropland 350 4 103 0 0.234
Forest 20 6976 40 0 0.009
Urban 87 16 1585 23 0.074
Water 0 0 0 267 0.000
Table 6. Regression Tree results

However, the Regression Tree model among all the models utilized in this study provides the clearest interpretation through its graphical representation. As illustrated below, the regression tree indicates that if band1 is less than 53.3 and NDVI is greater than or equal to -0.2036, the land use type is predicted to be forest. The numbers displayed beneath each land use type indicate the count of correctly or incorrectly classified observations. For example, the notation ???0/0/0/267??? under ???water??? means that all 267 observations are correctly classified as water bodies, with zero cases of misclassification.

The figure below shows the land cover types predicted by the regression tree model.

Logistic Regression

#Do a logistic regression for each class
classes <- levels(all.data$labels)
C <- length(classes)
logit.class <- as.data.frame(matrix(NA, nrow(all.data), C))
names(logit.class) <- classes

#We will first fit the model for each class (vs. all the other classes)
#and then we will save the predicted probabilities of being in that class
for(c in classes){
  model.fit <- glm(I(labels==c)~., data=tr_subset, family="binomial")
  logit.class[[c]] <- as.vector(predict(model.fit, newdata=all.data[,-1],
                                        type="response"))
}
pred.logit <- apply(logit.class, 1, function(x) which(x==max(x)))
logit.map <- raster(jul) #initialize a map. we will overwrite the values
pred.logit <- factor(pred.logit, levels=1:4, labels=classes)
values(logit.map) <- pred.logit
plot(logit.map, col=cols, legend=FALSE, main="Logistic Regression")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")

#confusion matrix for logistic regression 
logit.conf.mat <-table(pred=pred.logit[train.ids], train=all.data[train.ids, "labels"])
logit.conf.mat

The confusion matrix for the Logistic Regression model indicates that this method better classifies urban areas and water bodies compared to the previous methods. However, it struggles to distinguish between cropland and forest, showing less effectiveness in this regard than the other models.

Cropland Forest Urban Water Class Error
Cropland 299 0 61 0 0.169
Forest 29 6990 14 0 0.006
Urban 129 6 1653 1 0.076
Water 0 0 0 289 0.000
Table 7. Logistic regression results

Shown below are the land use types predicted by Logistic Regression.

In our approach to fitting the Logistic Regression model, we fit one land use type against the other three types at a time. Below, we present the model fit with respect to cropland land use type prediction as an example. One of the advantages of the logistic model is that it provides more interpretable results. Specifically, we can easily determine how a one-unit change in a predictor variables correlates with changes in the response variable. For instance, we observe that NDVI is the most significant predictor for cropland, with a one-unit increase in NDVI resulting in a decrease of 23 units in the log odds of a type being classified as cropland. A similar interpretation can be applied to the other land use types by examining their corresponding results.

Support Vector Machine

train2.ids <- rep(FALSE, ncell(jul))
train2.ids[seq(from = 1, to = ncell(jul), by = 10)] <- train.ids[seq(from = 1,to = ncell(jul), by = 10)]

fit.svm <- svm(labels ~ ., data = all.data, subset = train2.ids, kernel = "linear")
pred.svm <- predict(fit.svm, newdata = all.data[, -1])

svm.map <- raster(jul)
values(svm.map)[complete.cases(all.data[, -1])] <- pred.svm
plot(svm.map, col = cols, legend=FALSE,main = "SVM")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")

conf.mat.svm <- table(pred = pred.svm[train.ids], train = all.data[train.ids,"labels"])
conf.mat.svm

Confusion matrix for SVM shows that SVM performs better than Classification Tree and Logistic Regression, but not as good as Random Forest and Bagging.

Cropland Forest Urban Water Class Error
Cropland 353 3 84 0 0.198
Forest 21 6988 21 0 0.006
Urban 83 5 1623 1 0.052
Water 0 0 0 289 0.000
Table 8. Support Vector Machine(SVM) results

Shown below is the land use type prediction by Support Vector Machine.

A Visual Comparison with K-means Clustering Result

valuetable <- getValues(covs)
head(valuetable)
km <- kmeans(na.omit(valuetable), center=3, iter.max=100, nstart=20)
rNA <- setValues(raster(covs),0)

for(i in 1:nlayers(covs)){
  rNA[is.na(covs[[i]])] <- 1
}

rNA <- getValues(rNA)
valuetable <- as.data.frame(valuetable)
valuetable$class[rNA==0] <- km$cluster
valuetable$class[rNA==1] <-NA
classes <- raster(covs)

classes <- setValues(classes, valuetable$class)
plot(classes, legend=FALSE, col=cols, main="K-means")
legend("bottomright", legend=c("cropland","forest", "urban","water"),
       fill=cols, bg="white")

To visually compare the predicted results, we have also included the results from the K-means clustering (unsupervised learning), as shown in the figure below. It is evident that the four land use types are significantly mixed up, highlighting the limitations of the unsupervised approach. In contrast, all five supervised learning methods discussed in the previous sections demonstrate significantly improved performance over the unsupervised method. Therefore, it appears that unsupervised learning techniques, such as K-means clustering, are not suitable for our dataset.

Summary of Supervised Learning Results

For our dataset, Random Forest outperforms the other methods in terms of confusion matrix and misclassification error rates. Water bodies and forests are the two land use types that are clearly distinguished, while croplands are somewhat mixed with urban areas. In terms of interpretability, classification trees offer the best results, as their tree plot clearly displays the outcomes in an understandable way, even though they do not perform as well as Random Forest regarding misclassification rates. Additionally, in terms of predictor importance in determining land use types, band1 and NDVI emerge as the two most significant predictors. Overall, all five supervised learning methods perform nearly equally well on our training dataset. As illustrated in the figure below, the classification accuracy for cropland is around 80%, while the accuracy for urban areas exceeds 92%. For both forest and water land use types, the classification accuracy is higher than 99%.

Project Summary

As we initially recognized during the exploratory visual analysis phase, there is significant overlap between the urban and cropland categories. A detailed comparison of our images with satellite imagery from sources like Google aerial photography shows that many low-density urban areas, such as suburbs, are mistakenly classified as forests or croplands. This issue arises because suburban categories were not included during the training phase. Including these categories in our training would have likely improved classification results. Apart from SVM, all methods show similar levels of effectiveness concerning user and producer accuracy. Future discussions should focus on exploring why some methods perform better than others and in what situations they excel. Generally, supervised classification approaches consistently outperform unsupervised methods.

To conclude, there are several strategies to enhance our results. First, we should improve the training phase by incorporating a wider range of training regions to accurately include various types of each category, such as high and low-density urban areas, as well as evergreen and deciduous forests. Second, we should use higher-resolution images with greater precision for training. Although the current study effectively separates forest, urban, and water categories, there is some confusion between cropland and urban areas, likely due to insufficient training methods. By increasing the number of training polygons, the model can have more learning resources, potentially improving classification results. Additionally, we have only employed February Landsat images for our training; integrating images from July, December, or utilizing combined layers like PCA may improve classification accuracy. Third, incorporating additional variables, such as Census demographic data (e.g., population), could help differentiate suburbs from cropland or forest areas, as satellite images alone cannot perfectly classify land cover. Depending on the research goals, certain features should be prioritized over others, and careful pre-planning is essential to ensure that these features accurately represent the specific categories of interest.